Training a hierarchical classifier using inter document relationships

نویسندگان

  • Susan Gauch
  • Aravind Chandramouli
  • Shankar Ranganathan
چکیده

Concept hierarchies, also called taxonomies or directories, are widely used on the World Wide Web to organize and present large collections of Web pages. They were originally developed to help users locate relevant information by browsing. More recently, conceptual search engines such as KeyConcept have been developed that retrieve documents based upon the concepts they discuss in addition to the keywords they contain. Both applications require that documents be classified into appropriate concepts in a conceptual hierarchy. Most classification approaches use flat classifiers that treat each concept as independent, even when the concept space is hierarchically structured. In contrast, hierarchical text classification exploits the structural relationships between the concepts. In this paper, we explore the effectiveness of hierarchical classification for a large concept hierarchy. Since the quality of the classification is dependent on the quality and quantity of the training data, we evaluate the use of documents selected from subconcepts to address the sparseness of training data for the top-level classifiers and the use of document relationships to identify the most representative training documents. By selecting training documents using structural and similarity relationships, we achieve a statistically significant improvement of 39.8% (from 54.5% to 76.2%) in the accuracy of our classifier over that of the flat classifier for a large, 3-level concept hierarchy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hierarchical Classifier Applied to Multi-way Sentiment Detection

This paper considers the problem of document-level multi-way sentiment detection, proposing a hierarchical classifier algorithm that accounts for the inter-class similarity of tagged sentiment-bearing texts. This type of classifier also provides a natural mechanism for reducing the feature space of the problem. Our results show that this approach improves on state-of-the-art predictive performa...

متن کامل

bwbaugh : Hierarchical sentiment analysis with partial self-training

Using labeled Twitter training data from SemEval-2013, we train both a subjectivity classifier and a polarity classifier separately, and then combine the two into a single hierarchical classifier. Using additional unlabeled data that is believed to contain sentiment, we allow the polarity classifier to continue learning using self-training. The resulting system is capable of classifying a docum...

متن کامل

Collective document classification using explicit and implicit inter-document relationships

Information systems are transforming the ways in which people generate, store and share information. One consequence of this change is a massive increase in the quantity of digital content the average person needs to deal with. A large part of the information systems challenge is about finding intelligent ways to help users locate and analyse this information. One tool that is available to buil...

متن کامل

A hierarchical K-NN classifier for textual data

This paper presents a classifier that is based on a modified version of the well known K-Nearest Neighbors classifier (K-NN). The original K-NN classifier was adjusted to work with category representatives rather than training documents. Each category was represented by one document that was constructed by consulting all of its training documents and then applying feature selection so that only...

متن کامل

Collective Document Classification with Implicit Inter-document Semantic Relationships

This paper addresses the question of how document classifiers can exploit implicit information about document similarity to improve document classifier accuracy. We infer document similarity using simple n-gram overlap, and demonstrate that this improves overall document classification performance over two datasets. As part of this, we find that collective classification based on simple iterati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JASIST

دوره 60  شماره 

صفحات  -

تاریخ انتشار 2009